Project¶
Netflix Data: Cleaning, Analysis and Visualization¶
--A Complete End-to-End Analysis of Netflix Titles Using Python--
Domain: Data Analyst & Data scientist
Introduction:¶
This project focuses on analyzing the Netflix Movies and TV Shows dataset using Python. The goal is to clean the data, explore key patterns, and generate meaningful insights about Netflix content. Through data cleaning, exploratory data analysis (EDA), and feature engineering, we understand trends like content types, durations, release patterns, ratings, and genre combinations. This helps us clearly see how Netflix has evolved over the years and what kind of content it mostly provides.
Objectives:¶
The main objective of this project is to clean and analyse the Netflix dataset to understand the type of content available on the platform. I want to explore how many movies and TV shows are there, how their durations vary, which genres are common, what ratings are popular, and how Netflix has added content over time. I also created new features like number of genres, content age, and duration number to get deeper insights. Overall, the goal is to understand Netflix content trends using data cleaning, EDA, feature engineering, and visualizations.
Tools:¶
Python – Main language for data cleaning and analysis.
Pandas – Used for data manipulation and cleaning.
NumPy – Used for numerical operations.
Matplotlib – Used to create visualizations.
Seaborn – Used for advanced and stylish charts.
YData Profiling – Used to generate automatic data profiling report.
Dataset Description:¶
1. show_id- Unique ID given to each Netflix title.
2. type- Specifies whether the content is a Movie or TV Show.
3. title- Name of the movie or TV show.
4. director- Name of the director(s). Some titles may have empty values if the director is not available.
5. country- Country/countries where the title was produced.
6. date_added- The exact date when the movie or show was added to Netflix.
7. release_year- The year in which the movie/show was originally released.
8. rating- Content rating like TV-MA, TV-14, PG, R, etc., which shows age eligibility.
9. duration- For movies → duration in minutes For TV Shows → number of seasons
10. listed_in- Genres/categories of the title. Multiple genres are separated by commas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Data Loading:¶
df= pd.read_csv(r"C:\Users\sathv\Downloads\netflix1 (2).csv")
df.head()
| show_id | type | title | director | country | date_added | release_year | rating | duration | listed_in | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | United States | 9/25/2021 | 2020 | PG-13 | 90 min | Documentaries |
| 1 | s3 | TV Show | Ganglands | Julien Leclercq | France | 9/24/2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... |
| 2 | s6 | TV Show | Midnight Mass | Mike Flanagan | United States | 9/24/2021 | 2021 | TV-MA | 1 Season | TV Dramas, TV Horror, TV Mysteries |
| 3 | s14 | Movie | Confessions of an Invisible Girl | Bruno Garotti | Brazil | 9/22/2021 | 2021 | TV-PG | 91 min | Children & Family Movies, Comedies |
| 4 | s8 | Movie | Sankofa | Haile Gerima | United States | 9/24/2021 | 1993 | TV-MA | 125 min | Dramas, Independent Movies, International Movies |
df.describe()
| release_year | |
|---|---|
| count | 8790.000000 |
| mean | 2014.183163 |
| std | 8.825466 |
| min | 1925.000000 |
| 25% | 2013.000000 |
| 50% | 2017.000000 |
| 75% | 2019.000000 |
| max | 2021.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8790 entries, 0 to 8789 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8790 non-null object 1 type 8790 non-null object 2 title 8790 non-null object 3 director 8790 non-null object 4 country 8790 non-null object 5 date_added 8790 non-null object 6 release_year 8790 non-null int64 7 rating 8790 non-null object 8 duration 8790 non-null object 9 listed_in 8790 non-null object dtypes: int64(1), object(9) memory usage: 686.8+ KB
df.isnull().sum()
show_id 0 type 0 title 0 director 0 country 0 date_added 0 release_year 0 rating 0 duration 0 listed_in 0 dtype: int64
Checking how many Rows & Colomns¶
df.shape
(8790, 10)
Remove duplicates:¶
df.drop_duplicates(inplace=True)
df['date_added'] = pd.to_datetime(df['date_added'])
df['year'] = df['date_added'].dt.year
df['month'] = df['date_added'].dt.month
df['day'] = df['date_added'].dt.day
df[['date_added', 'year', 'month', 'day']].head()
| date_added | year | month | day | |
|---|---|---|---|---|
| 0 | 2021-09-25 | 2021 | 9 | 25 |
| 1 | 2021-09-24 | 2021 | 9 | 24 |
| 2 | 2021-09-24 | 2021 | 9 | 24 |
| 3 | 2021-09-22 | 2021 | 9 | 22 |
| 4 | 2021-09-24 | 2021 | 9 | 24 |
df = df.dropna()
df.dtypes
show_id object type object title object director object country object date_added datetime64[ns] release_year int64 rating object duration object listed_in object year int32 month int32 day int32 dtype: object
Importing ydata_profiling as pf¶
pip install ydata-profiling
Requirement already satisfied: ydata-profiling in c:\users\sathv\anaconda3\lib\site-packages (4.17.0) Requirement already satisfied: scipy<1.16,>=1.4.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (1.15.3) Requirement already satisfied: pandas!=1.4.0,<3.0,>1.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (2.2.3) Requirement already satisfied: matplotlib<=3.10,>=3.5 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (3.10.0) Requirement already satisfied: pydantic>=2 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (2.10.3) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (6.0.2) Requirement already satisfied: jinja2<3.2,>=2.11.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (3.1.6) Requirement already satisfied: visions<0.8.2,>=0.7.5 in c:\users\sathv\anaconda3\lib\site-packages (from visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling) (0.8.1) Requirement already satisfied: numpy<2.2,>=1.16.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (2.1.3) Requirement already satisfied: minify-html>=0.15.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (0.18.1) Requirement already satisfied: filetype>=1.0.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (1.2.0) Requirement already satisfied: phik<0.13,>=0.11.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (0.12.5) Requirement already satisfied: requests<3,>=2.24.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (2.32.3) Requirement already satisfied: tqdm<5,>=4.48.2 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (4.67.1) Requirement already satisfied: seaborn<0.14,>=0.10.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (0.13.2) Requirement already satisfied: multimethod<2,>=1.4 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (1.12) Requirement already satisfied: statsmodels<1,>=0.13.2 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (0.14.4) Requirement already satisfied: typeguard<5,>=3 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (4.4.4) Requirement already satisfied: imagehash==4.3.1 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (4.3.1) Requirement already satisfied: wordcloud>=1.9.3 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (1.9.4) Requirement already satisfied: dacite>=1.8 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (1.9.2) Requirement already satisfied: numba<=0.61,>=0.56.0 in c:\users\sathv\anaconda3\lib\site-packages (from ydata-profiling) (0.61.0) Requirement already satisfied: PyWavelets in c:\users\sathv\anaconda3\lib\site-packages (from imagehash==4.3.1->ydata-profiling) (1.8.0) Requirement already satisfied: pillow in c:\users\sathv\anaconda3\lib\site-packages (from imagehash==4.3.1->ydata-profiling) (11.1.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\sathv\anaconda3\lib\site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (3.0.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (4.55.3) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (1.4.8) Requirement already satisfied: packaging>=20.0 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (24.2) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (3.2.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\sathv\anaconda3\lib\site-packages (from matplotlib<=3.10,>=3.5->ydata-profiling) (2.9.0.post0) Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in c:\users\sathv\anaconda3\lib\site-packages (from numba<=0.61,>=0.56.0->ydata-profiling) (0.44.0) Requirement already satisfied: pytz>=2020.1 in c:\users\sathv\anaconda3\lib\site-packages (from pandas!=1.4.0,<3.0,>1.1->ydata-profiling) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\sathv\anaconda3\lib\site-packages (from pandas!=1.4.0,<3.0,>1.1->ydata-profiling) (2025.2) Requirement already satisfied: joblib>=0.14.1 in c:\users\sathv\anaconda3\lib\site-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.4.2) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\sathv\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\sathv\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\sathv\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\sathv\anaconda3\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (2025.8.3) Requirement already satisfied: patsy>=0.5.6 in c:\users\sathv\anaconda3\lib\site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (1.0.1) Requirement already satisfied: colorama in c:\users\sathv\anaconda3\lib\site-packages (from tqdm<5,>=4.48.2->ydata-profiling) (0.4.6) Requirement already satisfied: typing_extensions>=4.14.0 in c:\users\sathv\anaconda3\lib\site-packages (from typeguard<5,>=3->ydata-profiling) (4.15.0) Requirement already satisfied: attrs>=19.3.0 in c:\users\sathv\anaconda3\lib\site-packages (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling) (24.3.0) Requirement already satisfied: networkx>=2.4 in c:\users\sathv\anaconda3\lib\site-packages (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling) (3.4.2) Requirement already satisfied: puremagic in c:\users\sathv\anaconda3\lib\site-packages (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling) (1.30) Requirement already satisfied: annotated-types>=0.6.0 in c:\users\sathv\anaconda3\lib\site-packages (from pydantic>=2->ydata-profiling) (0.6.0) Requirement already satisfied: pydantic-core==2.27.1 in c:\users\sathv\anaconda3\lib\site-packages (from pydantic>=2->ydata-profiling) (2.27.1) Requirement already satisfied: six>=1.5 in c:\users\sathv\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib<=3.10,>=3.5->ydata-profiling) (1.17.0) Note: you may need to restart the kernel to use updated packages.
from ydata_profiling import ProfileReport
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
profile = ProfileReport(df)
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
%| | 0/13 [00:00<?, ?it/s] %|██████▍ | 1/13 [00:00<00:04, 2.65it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26.34it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
After completing the data cleaning process, I generated a data profiling report using YData Profiling. This report helped me understand the cleaned Netflix dataset in a detailed and easy way.
From the profiling report, I observed the following:
--All duplicate rows were removed during cleaning.
--Missing values in director, cast, and country were fixed, so now these columns are complete.
--The duration column has mixed formats like “90 min” and “1 Season”, which means I need to do extra feature engineering later.
--date_added converted correctly into datetime format after cleaning.
--Most of the columns are categorical and have high variety, especially title, cast, and listed_in.
--There are no numeric correlations because almost all columns are text-based.
--The data is well distributed between Movies and TV Shows.
Overall, the profiling report confirmed that the dataset is properly cleaned and ready for visualization and deeper EDA. It also helped me decide what transformations and analysis to perform next.
Exploratory Data Analysis¶
1. Movie vs TV Show Distribution¶
%matplotlib inline
type_counts = df['type'].value_counts()
plt.figure(figsize=(4,4))
sns.barplot(x=type_counts.index, y=type_counts.values,color="brown")
plt.title("Movies vs TV Shows Distribution")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()
plt.savefig('Movie_vs_TV_Show.png', bbox_inches='tight',dpi=300)
plt.close()
This chart shows that Netflix has more movies than TV shows. Movies are added in large numbers compared to shows,indicating movies dominate the platforms.
2.Movie Duration Analysis¶
# Filter only Movies
movies = df[df['type'] == "Movie"].copy()
# Convert "90 min" → 90
movies['duration_minutes'] = movies['duration'].str.replace(" min", "").astype(int)
plt.figure(figsize=(8,5))
plt.hist(movies['duration_minutes'], bins=20, color="brown")
plt.title("Distribution of Movie Duration on Netflix")
plt.xlabel("Duration (Minutes)")
plt.ylabel("Frequency")
plt.grid(True)
plt.savefig('Movie_Duration.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Most movies on Netflix fall between 80-120 mintues. Very long movies are rare, and short films appear less frequently. The duration follows a normal-like pattern.
3. Countries With Longest Average Movie Duration¶
# Group by country and calculate average duration
avg_duration_country = movies.groupby('country')['duration_minutes'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(10,6))
avg_duration_country.plot(kind='barh',color="brown")
plt.title("Top 10 Countries With Longest Average Movie Duration")
plt.xlabel("Average Duration (Minutes)")
plt.ylabel("Country")
plt.savefig('top_10_countries.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Some countries like Southh Korea, Cambodia, and Somalia produce longer movies on average. This helps identify regions where movies tend to have more runtime.
4. Seasonal Trend¶
#Which months add the most new content
monthly_add = df['month'].value_counts().sort_index()
plt.plot(monthly_add.index, monthly_add.values,color="brown")
plt.xticks(range(1,13))
plt.title("Content Added by Month")
plt.xlabel("Month")
plt.ylabel("Count")
plt.grid(True)
plt.savefig('seasonal_trend.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Netflix adds content throughout the year, but some months like July,August, and December show higher additions.This reveals seasonal upload patterns.
5. Which genre combinations are most common¶
genre_combinations = df['listed_in'].value_counts().head(10)
plt.figure(figsize=(10,6))
sns.barplot(x=genre_combinations.values, y=genre_combinations.index,color="brown")
plt.title("Most Common Genre Combinations on Netflix")
plt.xlabel("Count")
plt.ylabel("Genre Combination")
plt.savefig('most_common_genre.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
The bar chart shows that Dramas, International Movies, and Documentaries are the most common genre combinations on Netflix, each appearing in high counts (around 350 titles). This indicates Netflix strongly focuses on dramatic and international content, along with documentary-style shows and movies.
6. How many Netflix titles were released after 2015?¶
recent = (df['release_year'] > 2015).sum()
older = (df['release_year'] <= 2015).sum()
plt.figure(figsize=(6,3))
plt.pie([recent, older], labels=['After 2015', 'Before 2015'],
autopct='%.1f%%', colors=netflix_colors)
plt.title("Modern vs Older Content on Netflix")
plt.savefig('Modern_vs_Older_Content.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Most of Netflix's catalog was released after 2015.This means the platform mainly focuses on modern and recently produced content.
netflix_colors = ['#E50914', # Netflix Red
'#221F1F', # Netflix Black
'#F5F5F1', # Off-White
'#B81D24', # Dark Red
'#8B0000', # Deep Red
'#4A4A4A', # Grey
'#CCCCCC', # Light Grey
'#000000'] # Pure Black
7. Rating Distribution¶
rating_counts = df['rating'].value_counts().head(8)
plt.figure(figsize=(6,3))
plt.pie(rating_counts.values, labels=rating_counts.index, autopct='%.1f%%', colors=netflix_colors)
plt.title("Rating Distribution on Netflix")
plt.savefig('rating_distribution.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
TV-MA and TV-14 are the most common ratings,showing that Netflix has a lot of mature and teen-friendly content.Other ratings appear in smaller proportions.
before_2010 = (df['release_year'] < 2010).sum()
after_2010 = (df['release_year'] >= 2010).sum()
plt.figure(figsize=(6,3))
plt.pie([before_2010, after_2010],
labels=['Before 2010', 'After 2010'],
autopct='%.1f%%', colors=netflix_colors)
plt.title("Netflix Conent Split by Release Year")
plt.savefig('Modern_vs_Older_Content(2010).png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
The pie chart shows that most Netflix titles (around 84.8%) were released after 2010, while only 15.2% were released before 2010. This means Netflix mainly features modern and recent content.
df['num_genres'] = df['listed_in'].apply(lambda x: len(x.split(',')))
df[['listed_in', 'num_genres']].head()
| listed_in | num_genres | |
|---|---|---|
| 0 | Documentaries | 1 |
| 1 | Crime TV Shows, International TV Shows, TV Act... | 3 |
| 2 | TV Dramas, TV Horror, TV Mysteries | 3 |
| 3 | Children & Family Movies, Comedies | 2 |
| 4 | Dramas, Independent Movies, International Movies | 3 |
plt.figure(figsize=(4,4))
sns.countplot(x='num_genres', data=df, color=netflix_colors[0])
plt.title("Distribution of Number of Genres per Title")
plt.xlabel("Number of Genres")
plt.ylabel("Count")
plt.savefig('Count_number_of_genres.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
This chart shows that most Netflix titles belong to 3 genres, followed by 2 genres, while very few titles have only 1 genre. This means Netflix content usually falls under multiple categories.
2: Calculate how old each title is¶
current_year = 2024
df['content_age'] = current_year - df['release_year']
# sample output
df[['title', 'release_year', 'content_age']].head()
| title | release_year | content_age | |
|---|---|---|---|
| 0 | Dick Johnson Is Dead | 2020 | 4 |
| 1 | Ganglands | 2021 | 3 |
| 2 | Midnight Mass | 2021 | 3 |
| 3 | Confessions of an Invisible Girl | 2021 | 3 |
| 4 | Sankofa | 1993 | 31 |
plt.figure(figsize=(6,4))
plt.hist(df['content_age'], bins=20, color='#E50914')
plt.title("Distribution of Content Age (Years Since Release)")
plt.xlabel("Content Age (Years)")
plt.ylabel("Frequency")
plt.grid(True)
plt.savefig('distribution_of_content_age.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Most Netflix titles are very recent, with the majority released within the last 10 years. Very old titles are less common, which shows Netflix mainly provied modern content.
3: Extract numeric value from duration¶
df['duration_number'] = df['duration'].str.extract('(\d+)').astype(int)
# Sample output
df[['type', 'duration', 'duration_number']].head()
| type | duration | duration_number | |
|---|---|---|---|
| 0 | Movie | 90 min | 90 |
| 1 | TV Show | 1 Season | 1 |
| 2 | TV Show | 1 Season | 1 |
| 3 | Movie | 91 min | 91 |
| 4 | Movie | 125 min | 125 |
# Group by content type and calculate average duration_number
avg_duration = df.groupby('type')['duration_number'].mean()
# Plot
plt.figure(figsize=(4,4))
avg_duration.plot(kind='bar', color=netflix_colors[0])
plt.title("Average Duration of Movies vs TV Shows")
plt.ylabel("Average Duration (Minutes / Seasons)")
plt.xlabel("Type")
plt.savefig('.pavg_duration_of_movies_vs_ TV_shows.png', bbox_inches='tight',dpi=300)
plt.show()
plt.close()
Movies have a much higher average duration (around 100 min), while TV Shows usually have only 1-2 seasons n average. This shows a clear difference in how Netflix structures both content types.
Exporting Final Cleaned Dataset¶
df.to_csv("netflix_cleaned.csv",index=False)
Key Insights from the Analysis:¶
--Netflix has more movies than TV Shows.
--Most content is released after 2015.
--TV-MA and TV-14 are the most common ratings.
--Most movies are 80-120 minutes long.
--Genres like Dramas, International Movies, and Documentaries appear the most.
--Content is mostly recent, under 10 years old.
Conclusion:¶
This project helped me understand Netflix content by cleaning the data, doing EDA, and creating new features. I found that Netflix has more movies than TV shows, most titles are released after 2010, and genres like Drama and International Movies are the most common. Overall, the dataset shows that Netflix mainly focuses on modern and multi-genre content.